Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[llama] Store KV Cache on CPU and Use PyTorch SPDA for Next token generation #1182

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

zhentaoyu
Copy link

@zhentaoyu zhentaoyu commented Aug 2, 2024

What does this PR do?

image

Results

python run_generation.py --model_name_or_path meta-llama/Llama-2-7b-hf --max_new_tokens 4096 --bf16 --use_kv_cache --attn_softmax_bf16 --reuse_cache --do_sample --prompt "Tell me somethings about Intel"

  • with --kv_cache_on_host
```bash Stats: -------------------------------------------------------------------------------------------------------------- Throughput (including tokenization) = 2.132539697795915 tokens/second Number of HPU graphs = 14 Memory allocated = 12.68 GB Max memory allocated = 12.77 GB Total memory available = 94.62 GB Graph compilation duration = 5842.699780527037 seconds~~ -------------------------------------------------------------------------------------------------------------- ```

update 4b0fa1a

Stats:
-------------------------------------------------------------------------------------------------------------
Throughput (including tokenization) = 12.22449896564133 tokens/second
Number of HPU graphs                = 0
Memory allocated                    = 12.68 GB
Max memory allocated                = 12.68 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 1010.5770402610069 seconds
--------------------------------------------------------------------------------------------------------------
  • without --kv_cache_on_host
Stats:
--------------------------------------------------------------------------------------------------------------
Throughput (including tokenization) = 31.41817953959749 tokens/second
Number of HPU graphs                = 11
Memory allocated                    = 14.68 GB
Max memory allocated                = 14.68 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 397.36551256105304 seconds
--------------------------------------------------------------------------------------------------------------

Limitations

  • can not generate correct results when --use_hpu_graphs because it has host-device memory transfer in the self-attn forward process.

cc @airMeng and @luoyu-intel

Update

Yi-34b-chat on gaudi-2 with ~11k input + 5k output
command:

python run_generation.py \
--model_name_or_path 01-ai/Yi-34B-Chat \
--use_kv_cache \
--bf16 \
--attn_softmax_bf16 \
--reuse_cache \
--do_sample \
--dataset_name emozilla/pg19-test \
--batch_size 1 \
--max_input_tokens 11200 \
--column_name "text" \
--dataset_max_samples 1 \
--warmup 0 \
--n_iterations 1 \
--max_new_tokens 5000 \
--kv_cache_on_host
  • without kv_cache_on_host:
 09/18/2024 05:28:11 - INFO - __main__ - Graph compilation...
Traceback (most recent call last):
  File "/data/optimum-habana/examples/text-generation/run_generation.py", line 707, in <module>
    main()
  File "/data/optimum-habana/examples/text-generation/run_generation.py", line 655, in main
    generate_dataset(batch)
  File "/data/optimum-habana/examples/text-generation/run_generation.py", line 633, in generate_dataset
    outputs = model.generate(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/optimum-habana/optimum/habana/transformers/generation/utils.py", line 1299, in generate
    result = self._sample(
  File "/data/optimum-habana/optimum/habana/transformers/generation/utils.py", line 2239, in _sample
    self.htcore_generation.mark_step()
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/utils/internal.py", line 26, in wrapper
    func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/step_closure.py", line 66, in mark_step
    htcore._mark_step(device_str, sync)
RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_SYNHELPER workspace Allocation of size ::28127918336 failed!
  • with kv_cache_on_host:
Stats:
----------------------------------------------------------------------
Throughput (including tokenization) = 1.2790787964372536 tokens/second
Total runtime for dataset: 3909.073683977127
Memory allocated                    = 90.72 GB
Max memory allocated                = 91.63 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 3907.185397926951 seconds
----------------------------------------------------------------------
  • eblarge output token num with kv_cache_on_host:
    --max_input_tokens 11200 --max_new_tokens 10000
Stats:
----------------------------------------------------------------------
Throughput (including tokenization) = 1.2790787964372536 tokens/second
Total runtime for dataset: 3909.073683977127
Memory allocated                    = 90.72 GB
Max memory allocated                = 91.63 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 3907.185397926951 seconds
----------------------------------------------------------------------

@airMeng
Copy link

airMeng commented Aug 2, 2024

@hshen14 @luoyu-intel for awareness

@airMeng
Copy link

airMeng commented Aug 7, 2024

@mandy-li @libinta @dvarshney-habana This is the first PR of system optimization from intel neural compressor(INC) team, could you give a review?

Experiments of Llama2 on single Gaudi2 card with Xeon 8380 host. With offloading KV Cache and SDPA to CPU, we improve the context limit from 26k(input:10k+output:16k) to 310k(input:10k+output:300k).

Config Context HPU Memory (GB, steady/peak) CPU Memory (GB)
KV cache on HPU 10k+16k ~90GB NA
KV cache on HPU 10k+100 83.36/84.11 4.4
KV cache on HPU 12k+100 91.78/92.72 5.03
KV cache on HPU 12k+10k 92.06/93.0 7.68
KV cache on HPU 12k+100k OOM N/A
KV cache on HPU 10k+100k 86.22/86.97 31
KV cache on HPU 10k+300k 91.94/92.70 85

@emascarenhas
Copy link
Contributor

Please sync your PR with main/upstream and fix any merge conflicts. Thank you.

@zhentaoyu
Copy link
Author

Please sync your PR with main/upstream and fix any merge conflicts. Thank you.

done.

@imangohari1
Copy link
Contributor

imangohari1 commented Sep 10, 2024

@zhentaoyu
Thanks for the PR and the results in description.
Do I read this correctly that the use of kv chache on host is degregading the throughput, while not generating correct answer with hpu graphs? if so, what's the use of this option?

this PR also has merge conflict with main, could you please take a look at the differences?
We need to test this PR with CI system to make sure it is not breaking anything and it is not impacting any performance.

@zhentaoyu
Copy link
Author

@zhentaoyu Thanks for the PR and the results in description. Do I read this correctly that the use of kv chache on host is degregading the throughput, while not generating correct answer with hpu graphs? if so, what's the use of this option?

this PR also has merge conflict with main, could you please take a look at the differences? We need to test this PR with CI system to make sure it is not breaking anything and it is not impacting any performance.

  1. Yes. It's an option for long-context inference or generation when a single hpu card OOM. In this PR, I just use torch.Tensor.to to transfer kv_cache related tensors between CPU and Gaudi2 and make next token sdpa happen on CPU for saving data transferring time. However, It can not generate right answer when --use_hpu_graphs. I'm not familiar with the habana synapse graph, and please tell me if you have any insights, I'm happy to try to fix it.
  2. Ok, I have rebased the PR.

@zhentaoyu
Copy link
Author

Hi, @imangohari1, I have updated the PR (see descriptions). Could you please retake a look when you have free time? Please let me know if you have more comments or need more tests. Thanks a lot.
cc @hshen14

else:
unwrap_deepspeed_model(self).allocate_kv_cache(
bs * generation_config.num_beams, calculated_max_length, token_idx + num_virtual_tokens
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From line 1096 to 1107, I would like to suggest to change like this.

if not is_greedy_or_beam_and_bucket:
cache_device = "hpu"
if generation_config.kv_cache_on_host and self.config.model_type in ["llama"]:
print("Allocate KV Cache on CPU...")
cache_device = "cpu"
unwrap_deepspeed_model(self).allocate_kv_cache(
bs * generation_config.num_beams, calculated_max_length, token_idx + num_virtual_tokens,
device=cache_device
)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have updated it in 74e94ff. However, I can not remove the else line because I only modified the modeling_llama.py for this experimental feature.

@yeonsily
Copy link
Collaborator

@zhentaoyu Do you have a use case for "It's an option for long-context inference or generation when a single hpu card OOM." ?
The README example is llama 7b and we don't see advantage for this run. If we can put a real example it would be good.

@zhentaoyu
Copy link
Author

@zhentaoyu Do you have a use case for "It's an option for long-context inference or generation when a single hpu card OOM." ? The README example is llama 7b and we don't see advantage for this run. If we can put a real example it would be good.

Hi @yeonsily, thanks for your comment. Yes, I add a case in README and update the results in the PR description.

else:
with ht.sdp_kernel(enable_recompute=flash_attention_recompute):
else:
if kv_cache_on_host:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain what's the case switching kv_cache device? I thought line 656 is the case only when line 658.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this pr, we make kv cache store on cpu and do cpu sdpa only when generating the next token. The first token or prefill stage is performed on HPU due to its powerful computation ability under long-context scenario (long prompt in most cases). The full pipeline diagram shows on the pr description.
So line 658 tells the machine it can do pytorch-cpu sdpa (flash-attn) only when kv_cache_on_host & in next-token generation & inference stage. Otherwise, it will transfer the kv-cache to hpu device if need for its original operations.
Please let me know if you need more explanations or have some suggestions. Thanks.

@airMeng
Copy link

airMeng commented Oct 29, 2024

@zhentaoyu Do you have a use case for "It's an option for long-context inference or generation when a single hpu card OOM." ? The README example is llama 7b and we don't see advantage for this run. If we can put a real example it would be good.

@yeonsily the similiar features already available in tensorrt-llm https://nvidia.github.io/TensorRT-LLM/kv_cache_reuse.html#offloading-to-host-memory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants